library(tidyverse)
# Load data
ddm <- read_csv("data/DataDrivenMarketing.csv") |>
drop_na(Sales, TV, Influencer)
nrow(ddm)[1] 4539
General
Plots, Packages, and Functions:
tidyverseRequirements:
The type of influencer your advertising company uses to brainwash users is categorized in to 4 levels: Macro, Mega, Micro, and Nano. Load DataDrivenMarketing.csv and remove any rows that contain NA values in the Sales, TV, or Influencer columns only. Report how many rows are left after doing this.
There are many many different ways to do this. Here is a simple method…
library(tidyverse)
# Load data
ddm <- read_csv("data/DataDrivenMarketing.csv") |>
drop_na(Sales, TV, Influencer)
nrow(ddm)[1] 4539
Using the NA removed data, build an ordinary least-squares regression model with both TV and Influencer as predictors of Sales. Are each of its coefficients significantly different than 0? What is the multiple \(R^2\) of this model?
# Dummy coding
ddm$Mac <- ifelse(ddm$Influencer == "Macro", 1, 0)
ddm$Meg <- ifelse(ddm$Influencer == "Mega", 1, 0)
ddm$Mic <- ifelse(ddm$Influencer == "Micro", 1, 0)
# Model
mod2 <- lm(Sales ~ TV + Mac + Meg + Mic, data = ddm)
mod2sum <- summary(mod2)
rsq <- round(mod2sum$r.squared, 3)
# Results
mod2sum$coefficients
cat("\n")
rsq Estimate Std. Error t value Pr(>|t|)
(Intercept) 189.523112 1.75647337 107.89979 0.000000e+00
TV 2.913543 0.03394066 85.84225 0.000000e+00
Mac 72.191913 2.43900997 29.59886 3.444084e-176
Meg 45.080510 2.16053220 20.86547 2.139665e-92
Mic 29.603079 1.96000432 15.10358 2.524918e-50
[1] 0.812
All the coefficients are significantly different than 0.
\(R^2 = 0.812\)
Conduct a F-test to evaluate whether influencer significantly improved the fit fo the model over one with just TV as a predictor. Which is the preferred model?
Use of anova() is prohibited.
Report the F-statistic, degrees of freedom, and p-value.
mod1 <- lm(Sales ~ TV, data = ddm)
mod1sum <- summary(mod1)
N <- nrow(ddm)
P_mod2 <- 4
P_diff <- 4 - 1
Rsq_diff <- mod2sum$r.squared - mod1sum$r.squared
# F-statistic to compare the models
F_stat <- ((N - P_mod2 - 1) / P_diff) * (Rsq_diff / (1 - mod2sum$r.squared))
# P-Value
df1 <- P_diff
df2 <- N - P_mod2 - 1
p <- pf(F_stat, df1, df2, lower.tail = FALSE)
# Results
paste0("F = ", round(F_stat, 5))
paste0("df1 = ", df1, "; df2 = ", df2)
paste0("p = ", round(p, 5))[1] "F = 294.65871"
[1] "df1 = 3; df2 = 4534"
[1] "p = 0"
Adding ‘influencer’ significantly improves the model and is the preferred choice.
Using the preferred model from the previous question. Create a plot of the residuals to evaluate homogeneity of variance. Is the assumption reasonable?
ggplot(ddm, aes(x = mod2$fitted.values, y = mod2$residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 3, colour = 'red') +
labs(
x = "Fitted Values",
y = "Residuals"
)Yes, the assumption is reasonable.
Using the preferred model. Evaluate whether the residuals are normally distributed. Is the assumption reasonable?
ggplot(ddm, aes(sample = mod2$residuals)) +
stat_qq() +
stat_qq_line() +
labs(
x = "Theoretical Quantiles",
y = "Sample Quantiles"
)Yes, the assumption is reasonable.
Some archaeologists theorize that ancient Egyptians interbred with several different immigrant populations over thousands of years. To see if there is any indication of changes in body structure that might have resulted, they measured skulls of male Egyptians from 5 different epochs.
The data can be found in “SkullsComplete.csv”. The column “mb” measures the maximum breadth of the skull in millimetres.
For the remaining questions, we will not be using the columns “bh”, “bl” or “nh”. Remove them from the dataframe and display the first 10 rows.
There are may different ways to remove unnecessary columns, the easiest is (with the tidyverse) to use the
select()function and list which columns you want to keep.
library(tidyverse)
skulls <- read_csv("data/SkullsComplete.csv") |>
select(epoch, mb)
skulls[1:10, ]# A tibble: 10 × 2
epoch mb
<chr> <dbl>
1 c4000BC 131
2 c4000BC 125
3 c4000BC 131
4 c4000BC 119
5 c4000BC 136
6 c4000BC 138
7 c4000BC 139
8 c4000BC 125
9 c4000BC 131
10 c4000BC 134
Create a barplot of the mean maximal breadth measured for each epoch in the “SkullsComplete.csv” data. Give the plot errorbars with classic 95% confidence intervals.
Order the epochs so that, left to right, they go from earliest (\(4000_\text{BC}\)) to latest (\(150_\text{AD}\)). Note that years classified as “B.C.” count backwards. E.g., \(3300_\text{BC}\) is more recent than \(4000_\text{BC}\).
Adjust the y-axis scale so that it goes from 120 at the bottom to 140 at the top.
Make the bars interesting colours (don’t use ggplot’s default colours).
Display a data frame that shows the group means, as well as the lower and upper boundaries of the confidence intervals. Do not display any other statistics in the data frame.
#Factoring and ordering epoch
skulls$epoch <- factor(skulls$epoch,
levels = c(
"c4000BC",
"c3300BC",
"c1850BC",
"c200BC",
"cAD150"
)
)# Getting statistics for plot
plot_data <- skulls |>
group_by(epoch) |>
summarise(
n = length(mb),
m = mean(mb),
df = n - 1,
alpha = 0.05,
t_crit = abs(qt(alpha / 2, df = df)),
se = sd(mb) / sqrt(n),
moe = t_crit * se,
ci_bot = m - moe,
ci_top = m + moe
)
plot_data |> select(epoch, m, ci_bot, ci_top)# A tibble: 5 × 4
epoch m ci_bot ci_top
<fct> <dbl> <dbl> <dbl>
1 c4000BC 131.3667 129.4514 133.2820
2 c3300BC 132.3667 130.5706 134.1628
3 c1850BC 134.4667 133.1667 135.7666
4 c200BC 135.5 134.0365 136.9635
5 cAD150 136.1667 134.1688 138.1645
# Colour palette
palette <- hcl.colors(n = 7, palette = "Green-Brown")[2:6]
# Plot
ggplot(plot_data, aes(x = epoch, y = m)) +
geom_bar(stat = "identity", aes(fill = epoch), colour = "black") +
geom_errorbar(
stat = "identity",
aes(ymin = ci_bot, ymax = ci_top),
width = .25
) +
# Y-axis Scale Adjust
coord_cartesian(ylim = c(120, 140)) +
xlab("Epoch") +
ylab("Maximum Skull Breadth (mm)") +
theme(legend.position = "none") +
# Colours
scale_fill_manual(values = palette)Using the “SkullsComplete.csv”, create a ordinary least-squares linear model predicting maximal breadth (mb) with coefficients that make the following comparisons:
Report the model’s formula using the obtained coefficents.
Given the methods shown in class, there are two ways to achieve this: The first way is to use the ifelse() function to create your contrasts/dummy coding:
#Dummy Variables
skulls$cont1 <- ifelse(skulls$epoch == "c3300BC", 1, 0)
skulls$cont2 <- ifelse(skulls$epoch == "c1850BC", 1, 0)
skulls$cont3 <- ifelse(skulls$epoch == "c200BC", 1, 0)
skulls$cont4 <- ifelse(skulls$epoch == "cAD150", 1, 0)#Model
skull_mod <- lm(mb ~ cont1 + cont2 + cont3 + cont4, data = skulls)
# skull_modFormula:
\(\hat{y} = 131.367 + 1 \cdot x_1 + 3.1 \cdot x_2 + 4.133 \cdot x_3 + 4.8 \cdot x_4\)
The second way is to manually set the contrasts in R:
#Look at the ordering of the factor levels
levels(skulls$epoch)[1] "c4000BC" "c3300BC" "c1850BC" "c200BC" "cAD150"
# Create n-1 (4) Contrasts
cont1 <- c(0, 1, 0, 0, 0)
cont2 <- c(0, 0, 1, 0, 0)
cont3 <- c(0, 0, 0, 1, 0)
cont4 <- c(0, 0, 0, 0, 1)
contrasts(skulls$epoch) <- cbind(cont1, cont2, cont3, cont4)
contrasts(skulls$epoch) cont1 cont2 cont3 cont4
c4000BC 0 0 0 0
c3300BC 1 0 0 0
c1850BC 0 1 0 0
c200BC 0 0 1 0
cAD150 0 0 0 1
#Model
skull_mod <- lm(mb ~ cont1 + cont2 + cont3 + cont4, data = skulls)Formula:
\(\hat{y} = 131.367 + 1 \cdot x_1 + 3.1 \cdot x_2 + 4.133 \cdot x_3 + 4.8 \cdot x_4\)
Is the omnibus F-test from the previous question’s linear model statistically significant at an \(\alpha = 0.05\)? Report it’s value, degrees of freedom and p-value.
To get accurate results, you will need to extract the F-stat values from the summary output.
# F-Stat, df1, and df2
mod_sum <- summary(skull_mod)
# P-value
p_val <- pf(mod_sum$fstatistic[1],
df1 = mod_sum$fstatistic[2],
df2 = mod_sum$fstatistic[3],
lower.tail = FALSE
)
# Results
paste0("F_stat = ", round(mod_sum$fstatistic[1], 6))
paste0("df 1 = ", mod_sum$fstatistic[2])
paste0("df 2 = ", mod_sum$fstatistic[3])
paste0("p_val = ", round(p_val, 6))[1] "F_stat = 5.954613"
[1] "df 1 = 4"
[1] "df 2 = 145"
[1] "p_val = 0.000183"
Yes, it is significant.
What do the results you displayed in the previous question tell you?
Based on the planned contrasts/comparisons you used to evaluate the “SkullsComplete.csv” data, what epochs was there a significant difference between?
round(mod_sum$coefficients, 4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 131.3667 0.8389 156.6006 0.0000
cont1 1.0000 1.1863 0.8429 0.4007
cont2 3.1000 1.1863 2.6131 0.0099
cont3 4.1333 1.1863 3.4841 0.0007
cont4 4.8000 1.1863 4.0461 0.0001
There is a significant difference between the following epochs:
\(1850_{\text{BC}}\) & \(4000_{\text{BC}}\)
\(200_{\text{BC}}\) & \(4000_{\text{BC}}\)
\(150_{\text{AD}}\) & \(4000_{\text{BC}}\)
For the “SkullsComplete.csv” data, assume all the classic assumptions of a OLS model are satisfied. Calculate an omnibus F-test manually without using lm() or aov(). Report the following . . . .
Round displayed outputs to 6 decimal places for everything except the degrees of freedom.
These results should be identical to the earlier F-statistic and p-value you obtained. If you get a different result, you have done something wrong somewhere somehow.
# Grand mean
grand_m <- mean(skulls$mb)
# Total sum of squares
SST <- sum((skulls$mb - grand_m)^2)
# Model sum of squares
SSM <- sum(plot_data$n * (plot_data$m - grand_m)^2)
# Sum of squared residuals
SSR <- SST - SSM
# R^2
R_sq <- SSM/SST
# Degrees of Freedom
df_t <- nrow(skulls) - 1
df_m <- 5 - 1
df_r <- df_t - df_m
# Mean Squares
MSm <- SSM / df_m
MSr <- SSR / df_r
# F-stat
F_stat <- MSm / MSr
# P-value
p <- pf(F_stat, df1 = df_m, df2 = df_r, lower.tail = FALSE)
# Results
paste0("Grand Mean = ", round(grand_m, 6))
paste0("Total Sum of Squares = ", round(SST, 6))
paste0("Model Sum of Squares = ", round(SSM, 6))
paste0("Model Mean Squares = ", round(MSm, 6))
paste0("Residual Mean Squares = ", round(MSr, 6))
paste0("Multiple R^2 = ", round(R_sq, 6))
paste0("F = ", round(F_stat, 6))
paste0("df1 = ", df_m, ", df2 = ", df_r)
paste0("p = ", round(p, 6))[1] "Grand Mean = 133.973333"
[1] "Total Sum of Squares = 3563.893333"
[1] "Model Sum of Squares = 502.826667"
[1] "Model Mean Squares = 125.706667"
[1] "Residual Mean Squares = 21.110805"
[1] "Multiple R^2 = 0.141089"
[1] "F = 5.954613"
[1] "df1 = 4, df2 = 145"
[1] "p = 0.000183"
Does the model you created for the “SkullsComplete.csv” data violate the normality assumption?
Note: There are two different ways to check this, either is correct.
Option 1: Check the model’s residuals
ggplot(mapping = aes(sample = mod_sum$residuals)) +
stat_qq(size = 3) +
stat_qq_line() +
labs(
x = "Theoretical Quantiles",
y = "Sample Quantiles"
)Option 2: Check the normality of the individual epoch levels
# Colour palette
palette <- hcl.colors(n = 7, palette = "Green-Brown")[2:6]
# this is for colouring my plot pretty colours
ggplot(skulls, aes(sample = mb)) +
stat_qq(
shape = 21,
colour = "black",
aes(fill = epoch),
size = 2
) +
stat_qq_line() +
facet_wrap(~ epoch, scales = "free", nrow = 1) +
scale_fill_manual(values = palette) +
labs(
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme(legend.position = "none")The data seem to be acceptably normally distributed.
If you chose Option 2, there needs to be a seperate Q-Q plot for each group because they are independent groups (i.e., the distribution for each may be distinct). You should not be collapsing them into a single Q-Q plot.